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Abstract 


A spatial auditory display was designed for 
separating the multiple communication channels 
usually heard over one ear to different virtual 
auditory positions. The single 19” rack mount 
device utilizes digital filtering algorithms to 
separate up to four communication channels. 
The filters use four different binaural transfer 
functions, synthesized from actual outer ear 
measurements, to impose localization cues on 
the incoming sound. Hardware design features 
include “fail-safe" operation in the case of power 
loss, and microphone/headset interfaces to the 
mobile launch communication system in use at 
NASA Kennedy Space Center. An experiment 
designed to verify the intelligibility advantage of 
the display used 130 different call signs taken 
from the communications protocol used at NASA 
KSC. A 6 to 7 dB intelligibility advantage was 
found when multiple channels were spatially 
displayed, compared to monaural listening. The 
findings suggest that the use of a spatial 
auditory display could enhance both 
occupational and operational safety and 
efficiency of NASA operations. (Supported by 
NASA Ames and NASA KSC Director’s 
Discretionary Funding). 


1. INTRODUCTION 

1.1 Application to NASA communication 
systems. 

During fiscal year 1992, NASA Director's 
Discretionary Funding was received from 
Ames Research Center (ARC) and John F. 
Kennedy Space Center (KSC) by Drs. E. M. 
Wenzel and D. R. Begault, to develop a four 
channel spatial auditory display for 
application to multiple channel speech 
communication systems in use at KSC. A 
previously specified design (Begault & 
Wenzel, 1990; Begault, 1992a) was used to 
fabricate a prototype device, which was 


completed in February, 1993. 1 This 
prototype places four different 
communication channels in virtual auditory 
positions about the listener, by digitally 
filtering each input channel with binaural 
head-related transfer function (HRTF) data. 
Listening over headphones, one has a 
spatial sense of each channel originating 
from a unique position outside the head; i.e., 
as if four people were standing about you, 
speaking from different directions. 

Input channels to the spatial auditory display 
can be assigned to any position because the 
design uses four removable EPROMs 2 , 
with each EPROM corresponding to a 
particular target position. The EPROMs 
themselves can contain a binaural HRTF for 
any given position and measured ear. 
Hence, an important research question is to 
determine which four positions would be 
optimal for speech intelligibility of multiple 
sound sources. To begin to answer this 
question, the current investigation focused 
on what single spatialized azimuth position 
yielded maximal intelligibility against noise. 
This was accomplished by measuring 
intelligibility thresholds at 30° azimuth 
increments. Intelligibility is defined here as 
correct identification of a spatialized call sign 
(signal) against diotic 3 speech babble 
(noise). 

The KSC communications handbook 
(NASA-KSC, 1991) indicates a list of over 
3000 call signs, most of which are spoken as 
four individual letters-- e.g., "NTOC”. 
Communication personnel who monitor 
multiple radio frequencies must be able to 
hear these four letters clearly against 
speech. The use of speech babble as a 
noise source has been used in several 
studies investigating binaural hearing for 


1 Tom Erbe (Mills College, Sterling Software) 
implemented the firmware and hardware design 
into the prototype. 

2 EPROM - erasable-programmable-read-only 
memory chip. 

3 "Diotic" playback is defined as a single audio 
channel presented to both ears. 
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communication systems contexts (e.g., 
Pollack, & Pickett 1958). This study 
concludes with a first approximation of the 
answer to what HRTF positions are best 
used in the filter EPROMs within the 
prototype. 

1.2 Binaural advantages and speech 
intelligibility. 

The relationship between binaural hearing 
and the development of improved 
communication systems has been 
understood for over 45 years (Licklider, 
1948; see reviews in Blauert, 1983; Zurek, 
1993). As opposed to monotic (one ear) 
listening— the typical situation in 
communications operations- binaural 
listening allows a listener to use head- 
shadow and binaural interaction advantages 
simultaneously (Zurek, 1993). The head- 
shadow advantage is an acoustical 
phenomenon, caused by the interaural level 
differences that occur when a sound moves 
closer to one ear relative to the other. 
Because of the diffraction of lower 
frequencies around the head from the near 
ear to the far ear, only frequencies above 
approximately 1 .5 kHz are shadowed in this 
way. The binaural interaction advantage is a 
psychoacoustic phenomenon due to the 
auditory system's comparison of binaurally- 
received signals (Levitt & Rabiner, 1967; 
Zurek, 1993). 

Many studies have focused on binaural 
advantages for both for detecting a signal 
against noise (the binaural masking level 
difference, or BMLD), and for improving 
speech intelligibility (the binaural intelligibility 
level difference, or BILD). Studies of BMLDs 
and BILDs involve manipulation of signal 
processing variables affecting either signal, 
noise, or both. The manipulation can involve 
phase inversion, time delay, and/or filtering. 

Recently, speech intelligibility studies by 
Bronkhorst and Plomp (1988; 1992) have 
used a mannequin head to impose the 
filtering effects of the HRTF on both signal 
and noise sources. The HRTFs were used in 
either an unaltered condition, or with either 
time or amplitude components removed. 
Their results, summarized in Figure 1 , show 
a 6 to 10 dB advantage with the signal at 0° 
azimuth and speech-spectrum noise moved 
off axis, compared to the condition where 
speech and noise originated from the same 
position. Figure 1 also shows lower BILDs 


when either interaural time or amplitude 
differences are removed from the stimuli. 
This suggested the inclusion of HRTF 
filtering within a binaural display for speech 
communication systems (ref. Begault & 
Wenzel, 1990; Begault & Wenzel, 1992). 
According to a model proposed by Zurek 
(1993), based on averaged HRTFs specified 
in Shaw & Vaillancourt (1985), the average 
binaural advantage (speech signal fixed at 
0°, noise uniformly distributed across all 
azimuths, head free to move) is around 5 
dB, with head shadowing contributing about 
3 dB and binaural-interaction about 2 dB. 



cfT 

dL 

FF 


Figure 1. Data from Bronkhorst and Plomp 
(1988) for speech intelligibility gain. All 
stimuli were recorded with a mannequin 
head. Speech signal fixed at 0°; noise 
moved along azimuth at 0° elevation. FF « 
data including effects of the HRTF; dT « 
same data with binaural amplitude 
differences removed; dL - same data but 
with binaural time differences removed. 

Another advantage for binaural speech 
reception relates to the ability to switch 
voluntarily between multiple channels, or 
"streams", of information (Bregman, 1990; 
Deutsch, 1983). The improvement in the 
detection of a desired speech signal against 
multiple speakers commonly referred to as 
the "cocktail party effect" (Cherry, 1953; 
Cherry & Taylor, 1954) is explained by 
Bregman (1990) as a form of auditory 
stream segregation. This situation was found 
to parallel the multiple channel listening 
requirements of communication personnel, 
such as test directors (NTDs) at KSC. 
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2. METHOD 
2.1 Stimuli. 


The signal portion of the stimulus was drawn 
from a list of 130 four letter call signs, 
selected from the KSC communication 
handbook (NASA-KSC, 1991). The 130 call 
signs used in the experiment were selected 
randomly so that groups of five began with a 
unique letter of the alphabet. A single male 
voice was used, with each letter of the call 
sign spoken discontinuous^ over a duration 
of about two seconds. Recordings took 
place in sound-proof booth, using an AKG 
C451-EB microphone at a distance of 6 
inches. Once digitized, each call sign 
combination was normalized in amplitude, 
and then scaled to have equal long-term 
r.m.s. measurement values. 

The speech babble used for the noise 
portion of the stimulus consisted of multiple 
layers of voices: two layers were from 
different airport control tower frequencies, 
containing both female and male voices, 
with silent intervals of more than .2 seconds 
deleted; and two recordings of different male 
voices reading technical repair manuals, one 
played backwards, the other pitch shifted 
upwards 4 semitones. The result was a 
dense speech layer in which words could 
occasionally be distinguished, but semantic 
content was lost. 


The noise and speech were digitally stored 
as separate channels of stereo sound files 
(see Figure 2), using an Apple Macintosh II 
and Digidesign's ProTool hardware and 
software. The duration of each sound file 
used in each stimulus presentation was 
adjusted to 5 seconds, with the noise 
channel faded in and out over the first and 
last 0.5 seconds. The signal was always 
presented 1 .5 seconds into the sound file, 
allowing subjects to predict its onset. 



SIGNAL: 
diotic or 
spatial ized 
call sign 

| lev* 

1 staircases 
▼ downward 

/ NOISE: diotic speech 
/ babbie 

level 

remains 

fixed 

\ 

0 

time ^ 


5sec.s 


Figure 2. Stimulus soundfile arrangement 


Each of the 130 separate noise-signal sound 
files was played through signal processing 
software and hardware, using a Crystal 
River Engineering Convolvotron that also 
served as the experimental software host 
computer (see Wenzel, 1992, for additional 
information on the hardware). Upon 
playback, the Convolvotron passed the 
speech babble channel unaltered to both 
ears. Mixed in with this noise was the two- 
channel signal, after software intensity 
scaling and HRTF-based spatialization to 
azimuths at 30 degree increments between 
30° - 330° (all at 0° elevation). A diotic 
control condition was also used for the 
signal, where the spatialization was 
bypassed and only intensity scaling was 
used. 

The minimum-phase HRTFs used for the 
spatialization were reconstructed from actual 
HRTF measurements as described in Kistler 
& Wightman (1992). The original 
measurements used were of one subject 
(SDO in Wightman & Kistler, 1989), with the 
headphone frequency response (Sennheiser 
HD-4301 divided out of the HRTF. Although 
the same model of headphone was used for 
the subjects in this experiment, non- 
linearities in reproducing the HRTF were 
introduced as a result of the interaction 
between different pinnae and the headphone 
chambers. Data on localization error of 
speech with non-individualized HRTFs can 
be found in Begault & Wenzel (1991) and 
Begault (1992b). 

2.2 Subjects 

Five subjects (4 males, 1 female), were paid 
$5.59 an hour to participate in the study over 
two three hour sessions. This was the "naive 
subjects" group in that they had no exposure 
to the call sign list. Another group of 3 lab 
personnel (3 males) who had previous 
exposure to the call sign list constituted the 
"experienced subjects" group; their data is 
analyzed separately from the naive subject 
group. This group included a subject whose 
voice was used in the signal. 

All subjects were evaluated for normal 
hearing from 0.1 - 8 kHz in a pure tone 
audiometer test. Subjects were given a 
training session before starting the 
experiment to familiarize themselves with 
the computer, the time when to expect the 
signal in relation to the noise, and the 
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procedure for entering responses. This 
training session consisted of a dummy block 
where the level of the signal was clearly 
audible against the noise, and was never 
scaled. The formal blocks were begun after 
approximately 20 trials. 


signs; the subjects were only notified when 
the 20 staircases within a particular block (4 
spatial conditions times 5 staircases) were 
completed. 

3. RESULTS 


2.3 Procedure 

Software was developed by Phil Stone 
(Sterling Software) for presenting stimuli and 
gathering data from subjects using an 
interleaved, transformed up-down 
"staircase" method (Levitt, 1970). The 
software varied the level of the signal 
against the noise, starting with a maximum 
stepsize interval of 6 dB, and decreasing to 
a minimum stepsize of 1 dB. The response 
sequences were evaluated in such a way as 
to determine the threshold at a 70.7% 
probability level (a "2 up, 1 down" 
procedure). 

The decibel level between the diotic stimuli 
and the spatialized stimuli were considered 
to be equal with reference to the long-term 
r.m.s. value of speech-spectrum noise 
filtered by a left ear 0° HRTF (obtained from 
the same HRTF set used for the other 
spatialized positions). The playback level 
was around 55 dB SPL, when the noise and 
0° HRTF-filtered calibration signals were 
played simultaneously. 

Six blocks were administered to each 
subject over three or four days, with each 
block containing four staircases randomly 
chosen from the 11 possible spatial 
positions or the one diotic signal condition. 
The four staircases within each block were 
presented randomly, as were the 130 call 
sign-speech babble sound files used for a 
particular stimulus block. The staircases 
within the blocks were arranged so that ten 
threshold values were obtained from each 
subject for each spatial condition, and the 
diotic condition. No block contained two 
simultaneous staircases for the same spatial 
condition of the signal. 

Upon hearing the stimulus, the subject typed 
the four letters they thought they had heard 
onto a computer keyboard, and then after a 
short pause the software would present the 
next trial. The duration to complete each 
block of four staircases was about 15-20 
minutes. Testing was administered in a 
sound-proof booth. No feedback was given 
as to the correct identification of the call 


Figure 3 summarizes the data for the six 
naive subjects, and Figure 4 summarizes the 
data for the three experienced subjects. The 
mean values for each position were obtained 
before grouping the data by first subtracting 
each individual subject's threshold for the 
diotic signal vs. diotic speech babble 
condition. The results in Figures 3-4 show a 
greater intelligibility advantage as the signal 
is moved from to either side of the head; the 
advantage is maximal between 60° - 90° 
and 270° - 300°. These are locations where 
both head-shadowing is maximized, and 
where the binaural interaction advantage 
mechanism is given maximal time 
differences. 



Figure 3. Data for the naive subject group (4 
males, 1 female). The mean value for the 
diotic signal condition were subtracted from 
each spatialized signal value. Standard 
deviation bars were based on the 10 
staircase solutions obtained for each 
condition. 
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Figure 4. Data for the experienced subject 
group (see Figure 3). 

Figure 5 summarizes Figures 3-4, by 
showing the mean values for symmetrical 
left-right positions about the head. This 
suggests that, without reference to which 
side a sound is spatialized, the preferred 
order for HRTF-processing for maximal 
intelligibility is 60° or 90°, then 120°, then 
30°, then 150°, and finally 180°. The latter is 
hardly better than performance with the 
diotic stimuli. Figure 5 also shows that the 
three experienced subjects achieved about a 
1 dB additional intelligibility advantage over 
the five naive subjects. However, an 
analysis of variance revealed that no 
significant difference existed between these 
two subject categories, £ (1,6) = 2.90, & = 
0.14. 

The mean values for four of the naive 
subjects had a pattern that followed the 
symmetrical trend of the overall mean shown 
in Figure 3; there seemed to be no preferred 
side to hear the signal. Contrasting this, the 
responses of one of the naive subjects had 
an asymmetrical trend, favoring right side 
positions over left side positions. This trend 
was similar to a potential subject whose data 
was excluded from the subject pool and the 
analysis above due to hearing loss at the left 
ear (between 20 - 35 dB HL at 4, 6, 8 and 
12 kHz). 


■ Naive Subj.s 

O Experienced Subj.s 



Azimuth of spatialized signal 
(mean of left & right sides) 


Figure 5. Mean values from Figures 3-4 
collapsed about symmetrical left-right 
positions. 


Figure 6 shows the results for these two 
subjects, along with the overall means from 
the naive subject group. Except for the 60° 
azimuth position, both of these subjects had 
a smaller advantage for left side positions 
compared to the overall mean, and right side 
positions show a greater advantage. 
Additional data would be needed to 
determine if there was a significant effect 
due to handedness or other factors 
(Deutsch, 1983). Nevertheless, a person 
with asymmetrical hearing loss similar to that 
experienced by the subject shown in Figure 
6 could still benefit from using a 3-D auditory 
display. Gabriel, Koehnke and Colburn 
(1991) and Perrott, Sadralodabi, Saberi and 
Strybel (1991) have pointed out that, 
excluding severe hearing loss, no apparent 
relation between audiometric measurements 
and binaural performance can be 
established. 
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O subject with nomtal bearing 
m subject with asymmetrical hearing loss 
■ overall mean for naive subject group 



Figure 6. Two subjects ( one from the naive 
group, one subject w/ asymmetrical hearing 
loss) who tended to favor the right side 
positions over the left. Overall means (from 
Figure 3) shown for comparison. 


4. DISCUSSION 

Overall, a 6-7 dB advantage for left and right 
60° and 90° positions was found in the 
present study, which exceeds the binaural 
advantage cited in Zurek's model (1993) by 
1-2 dB. This means that headphone listening 
with static spatial positions through the 
hardware prototype is as least as good as a 
normal hearing, binaural listener who is free 
to move their head. Although Bronkhorst and 
Plomp (1988) found a 10 dB advantage for a 
signal at 0° azimuth and speech- spectrum 
noise at 90°, their results are not directly 
comparable to those found here since both 
signal and noise were HRTF-filtered by their 
mannequin head, and in the present study 
the noise portion of the stimulus was diotic. 
The additional release from masking they 
found may have been attained through 
either HRTF-filtering of both signal and 
noise, the use of noise rather than speech 
babble, or both. 

The results found here are limited by the fact 
that only one male speaker was used for the 
signal portion of the stimulus. In spite of the 
care taken in preparing the stimulus through 


digital editing, there is the potential that 
extraneous variation was introduced into the 
results because of the variability of spoken 
intelligibility (ANSI, 1989). Furthermore, the 
average spectrum of this particular speaker 
might have interacted differently with the 
HRTF filtering than that of another speaker 
(e.g., a female voice). Finally, the variability 
in HRTF measurements from different 
persons or reconstruction techniques could 
influence the results of any experiment that 
uses only one set of HRTFs. This is one 
reason the prototype was designed to allow 
interchangeable EPROMs- individuals could 
tailor systems to their best advantage by 
using a preferred set of HRTFs. 

5. CONCLUSION 

The advantage of a binaural auditory display 
for multiple communication channels has 
been demonstrated, through a case study of 
a single signal at incremented 30° azimuth 
positions against a diotic, speech babble 
noise source. The 6 -7 dB advantage for 60° 
and 90° HRTF-filtered speech represents a 
halving of the intensity (acoustic power) 
necessary for correctly identifying a four 
letter call signs typical of those used in 
communication systems at KSC. This 
reduction in the likelihood of misinterpreting 
call signs over communication systems is an 
important safety improvement for "high 
stress", human-machine interface contexts. 
The binaural advantage could also benefit 
communications personnel because the 
overall intensity of communications 
hardware could be reduced without 
sacrificing intelligibility. Lower listening levels 
over headphones could possibly reduce the 
risk of threshold shifts, the Lombard Reflex 
(raising the intensity of one's own voice; see 
Junqua, 1993), and overall fatigue, thereby 
making additional contributions to safety. 

Overall, the findings here suggest that the 
use of a spatial auditory display could 
enhance both occupational and operational 
safety and efficiency of NASA operations. 
Additional studies are underway at Ames to 
simulate other applications scenarios within 
speech intelligibility experiments to 
determine the additional benefits, if any, of 
spatial audio communications displays. 
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